Implementation: Sarsamax

The pseudocode for Sarsamax (or Q-learning) can be found below.

Sarsamax is guaranteed to converge under the same conditions that guarantee convergence of Sarsa.

Please use the next concept to complete Part 3: TD Control: Q-learning of Temporal_Difference.ipynb. Remember to save your work!

If you'd like to reference the pseudocode while working on the notebook, you are encouraged to open this sheet in a new window.

Feel free to check your solution by looking at the corresponding section in Temporal_Difference_Solution.ipynb.